With both the "Negative Sampling" and the "Subsampling of Frequent Words" techniques, the sampling is based on word frequency. That is, words are sampled differently depending on how many times they appear in the training data. In this notebook, we will look at some actual word frequency data and see how the word2vec sampling functions behave on this data.
Let's start by checking out some word frequency data--we'll use word counts which were extracted from a full dump of Wikipedia.
GitHub user IlyaSemenov was nice enough to calculate the word occurrence counts for every word in Wikipedia and share the results in this project.
Each word is followed by its count, separated by a space, with one word per line. You can download the file directly from here, or use the below code to grab it for you.
import requests
import os
# Create the data subdirectory if not there.
if not os.path.exists('./data/'):
os.mkdir('./data/')
# URL for the text file (~22MB) containing words and
# their counts.
url = 'https://raw.githubusercontent.com/IlyaSemenov/' \
'wikipedia-word-frequency/master/results/' \
'enwiki-20150602-words-frequency.txt'
print('Downloading Wikipedia word frequencies...')
# Download the file to the `/data/` subdirectory.
r = requests.get(url, allow_redirects=True)
with open('./data/enwiki-20150602-words-frequency.txt', 'wb') as f:
f.write(r.content)
print(' Done.')
Let's start by reading in the words and their counts as two separate lists. Conveniently, they have already been sorted by descending frequency.
%%time
words = []
word_counts = []
print('Reading in word counts...')
# Open the file and read line by line.
with open('./data/enwiki-20150602-words-frequency.txt',
'r', encoding='utf-8') as f:
for line in f:
# Remove the newline.
line = line.rstrip('\n')
# The words and counts are separated by a single space.
(word, count) = line.split(' ')
# Store the word and its count.
words.append(word)
word_counts.append(int(count))
print(' Done.')
Let's look at a few basic properties of the data before applying it to negative sampling.
There appears to be 1.9M words in the vocabulary, and 1.5B total word occurrences!
import numpy as np
# Convert the list to a numpy array so we can do math on it easily.
word_counts = np.asarray(word_counts)
# Total words in the training set.
# Make sure to sum with int64, otherwise it will overflow!
total_words = np.sum(word_counts, dtype=np.int64)
print('Number of words in vocabulary: {:,}\n'.format(len(words)))
print('Total word occurrences in Wikipedia: {:,}\n'.format(total_words))
Just out of curiosity, here are the most frequent and least frequent words.
print('The 10 most frequest words:\n')
print(' --Count-- --Word--')
# For the first ten word counts...
for i in range(10):
# Print the count with commas, and pad it to 12 characters.
print('{:>12,} {:}'.format(word_counts[i], words[i]))
print('The 10 least frequest words:\n')
print(' --Count-- --Word--')
# Go backwards through the last 10 indeces...
for i in range(-1, -10, -1):
# Print the count with commas, and pad it to 12 characters.
print('{:>12,} {:}'.format(word_counts[i], words[i]))
When given a list of words representing a sentence, the word2vec algorithm goes through each word and rolls some dice to decide whether to keep each word or cut it from the sentence.
The following equation is used to calculate the probability of keeping word $ w_i $, where $ z(w_i) $ is the fraction of the total word occurrences which are word 'i'.
$ P(w_i) = (\sqrt{\frac{z(w_i)}{0.001}} + 1) \cdot \frac{0.001}{z(w_i)} $
The value 0.001 is a parameter which can be modified. Smaller values of 'sample' mean words are less likely to be kept / more likely to be tossed.
Let's try applying this formula to our Wikipedia word data to see what happens.
# Divide each word's occurrence count by the total number of words to get that z(w_i) number.
word_fractions = word_counts / float(total_words)
# Coefficient to control how aggressively we subsample.
coeff = 0.001
# Let's calculate the probabilities for all words.
#
# Numpy is handling a lot for us here. It knows that when we divide
# an array by a scalar (like word_fractions / coeff) we mean to divide
# each element by the scalar.
# And when we multiple two arrays together, it knows to multiply each
# element of the arrays together. (If we had wanted the dot product
# instead, we would use np.dot)
word_sample_rates = (np.sqrt(word_fractions / coeff) + 1) * (coeff / word_fractions)
Now that we have the sampling probabilities calculated for all words, let's see what the probabilities are for the most common words.
print("--Rank-- --Word-- --Probability to keep--")
# Go through the words (from most frequent down) until we find
# the first word which will always be kept.
for i in range(len(words)):
print('%4d %10s %.0f%%' % (i, words[i], word_sample_rates[i]*100.0))
# Stop when we go above 100%
if word_sample_rates[i] > 1.0:
break
These results surprised me. For this dataset, the subsampling equation (with coefficient 0.001) only removes the top 27 stop words. Perhaps this is the intended behavior, or perhaps the Google News dataset had a different word frequency distribution.
As we'll see in the next section, though, this subsampling does have a significant impact on the size of the training set.
With the sampling rates calculated, we can now calculate the expected reduction in size of the training set. Note that this sampling is not a pre-processing step, but rather a decision that's made on each word as we go through the training data. The training dataset is not modified, either--each training pass will remove different words, and the actual number of words removed will vary. The calculation below is just an average.
# Limit the sample rates to a max value of 1.0.
wsr = np.clip(word_sample_rates, 0.0, 1.0)
# Multiply each word's count by its sample rate, and sum the result
# to get the expected number of sampled words.
sampled_words = np.dot(word_counts, wsr)
# Report the result.
print("\nOf {:,} words in the training set, on average {:,} will be sampled.".format(total_words, int(sampled_words)))
# By what percent did we reduce the training set?
percent_sampled = (total_words - sampled_words) / total_words * 100.0
print("\nThis is a %.2f%% reduction in size." % (percent_sampled))
Below is some code for taking an example sentence and displaying the sampling probabilities for each word.
# Let's define this as a function so we can re-use it on other sentences.
def subsample_sentence(sentence, word_sample_rates):
print("\n%8s %15s %s" % ('--Rank--', '--Word--', '--Probability to keep--'))
# For each word in the sentence
for word in sentence:
# Find the word's index.
i = words.index(word)
#print(word_fractions[i])
word_prob = word_sample_rates[i]
if word_prob > 1.0:
word_prob = 1.0
print('%8d %15s %.0f%%' % (i, word, word_prob*100.0))
# Let's try a classic example.
sentence = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
subsample_sentence(sentence, word_sample_rates)
# And now something arbitrary.
sentence = ['below', 'is', 'some', 'code', 'for', 'taking', 'an', 'example', 'sentence', 'and', 'displaying', 'the', 'sampling', 'probabilities', 'for', 'each', 'word']
subsample_sentence(sentence, word_sample_rates)
Recall that when training the word2vec neural network, we only update the weights for a small number of "negative" words (that is, words which are not in the input word's context). This is referred to as negative sampling.
The negative samples are selected at random, but with a probability tied to the word's frequency in the training data. That is, more common words are selected more often as negative samples.
The most straightforward sampling for negative words would be to select them with the same probability as their occurrence in the training data:
$ P(w_i) = \frac{ f(w_i) }{\sum_{j=0}^{n}\left( f(w_j) \right) } $
That is, the probability of selecting word 'i' is just the number of occurrences of word i over the total number of word occurrences in the corpus.
Let's plot the word counts to see what this distribution would look like.
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
# For each word, calculate the probability that you would choose it
# at random from the training set.
word_probs = word_counts / float(total_words)
# The x-axis will just be the word "ranking".
# Because the words are all sorted by frequency, a
# word's index in the list is also its rank.
word_ranks = range(0, len(words))
sns.set(style='darkgrid')
# Plot the word counts on a logarithmic scale.
plt.yscale('log')
plt.plot(word_ranks, word_probs)
plt.xlabel('Word Rank')
plt.ylabel('Word Probability')
plt.title('Probability of selecting each word')
plt.show()
It looks like the elbow of the plot is somewhere around word 125k, so let's check out that region.
It's interesting to see that these are still (mostly) reasonable words.
print('Words near elbow at 125,000:\n')
print('%30s --Count--' % '--Word--')
# For the first ten word counts...
for i in range(125000, 125010):
print('%30s %d' % (words[i], word_counts[i]))
The word2vec authors found that a slight modification to this function produced the best results. They raised each word count to the 3/4 power:
$ P(w_i) = \frac{ {f(w_i)}^{3/4} }{\sum_{j=0}^{n}\left( {f(w_j)}^{3/4} \right) } $
Let's calculate these modified probabilities, and plot them together with the unigram distribution to see the difference.
# Raise all word counts to the 3/4 power.
word_counts_34 = word_counts**0.75
# Sum these to get the denominator.
total_words_34 = np.sum(word_counts_34)
# Calculate the word probabilities
word_probs_34 = word_counts_34 / total_words_34
# Create a new figure and retrieve the axis object so we can
# plot both distrubtions together.
fig = plt.figure()
ax1 = fig.add_subplot(111)
# Plot both distributions.
ax1.plot(word_ranks, word_probs, c='b', label='Unigram')
ax1.plot(word_ranks, word_probs_34, c='r', label='Unigram^3/4')
# Show the probabilities in log scale.
plt.yscale('log')
# Add the legend and label the plot.
plt.legend()
plt.xlabel('Word Rank')
plt.ylabel('Word Probability')
plt.title('Comparison of distributions')
plt.show()
We can see that, on the very far left, the modified distribution decreases probability for the most frequent words. But for the majority of the words, the modified distribution increases their probability.
Out of curiosity, where does it flip?
for i in range(len(word_probs)):
if word_probs[i] < word_probs_34[i]:
print("'%s', ranked %d, is the first word which has a " \
"higher probability with the modified distribution." % (words[i], i))
print("This means probability is decreased for only " \
"the top %.2f%% of the vocabulary." % (float(i) / float(len(word_probs)) * 100.0))
print("The word has unigram probability %G, " \
"and probability %G with the modified distribution." % (word_probs[i], word_probs_34[i]))
print("The word occurs {:,} times in the corpus\n".format(word_counts[i]))
break
If you load the Google News model in gensim (as we did in Chapter 1), you can look at, for example, model.vocab["hello"].count. I assumed this was the word occurrence count that we want, but it's actually just the word's ranking. So if you sort the 3M words in the Google News vocabulary by this "count" property, you'll just get the correct ranking of the words, but not the actual number of occurrences.
The below code demonstrates this.
%%time
from gensim.corpora import Dictionary
import gensim
filepath = './data/GoogleNews-vectors-negative300.bin'
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format(filepath, binary=True)
# Get the list of words in the vocabulary.
words = model.vocab.keys()
# Create a corresponding list of the count for each word.
word_ranks = []
for word in words:
word_ranks.append(model.vocab[word].count)
# Sort both lists by the word counts, descending (most frequent first).
# Please forgive the python nastiness! We're combining the lists,
# sorting by the counts, then splitting them back into two separate
# lists.
word_ranks, words = map(list, zip(*sorted(zip(word_ranks, words), reverse=True)))
print('\nThe 10 most frequest words:\n')
print(' --Rank-- --Word--')
# For the 10 most frequent words...
for i in range(10):
# Print the rank with commas, and pad it to 12 characters.
print('{:>12,} {:}'.format(word_ranks[i], words[i]))
print('\nThe 10 least frequest words:\n')
print(' --Rank-- --Word--')
# Go backwards through the last 10 indeces...
for i in range(-1, -10, -1):
# Print the count with commas, and pad it to 12 characters.
print('{:>12,} {:}'.format(word_ranks[i], words[i]))